Record: MuonEq-R + Depth Recurrence + Mixed Int5/Int6 GPTQ — val_bpb 1.0929 (3-seed mean)#1260
Open
dexhunter wants to merge 1 commit intoopenai:mainfrom
Open
Record: MuonEq-R + Depth Recurrence + Mixed Int5/Int6 GPTQ — val_bpb 1.0929 (3-seed mean)#1260dexhunter wants to merge 1 commit intoopenai:mainfrom
dexhunter wants to merge 1 commit intoopenai:mainfrom
Conversation
…1.0929 (3-seed mean) Adds three techniques to PR openai#1218's 4096-vocab high-WD stack: - MuonEq-R optimizer (row-norm before NS5 orthogonalization) - Depth recurrence on layers 4,5 (shared MLP, zero extra params) - Mixed int5/int6 GPTQ via Hessian sensitivity ranking 3-seed mean: 1.0929 BPB / 2.5145 nats All seeds under 16MB (max: 15,981,324 bytes) No TTT, no SLOT, no eval-time adaptation.
|
Great submission @dexhunter! Did you happen to test muon column norm or row+column norm? I found R+C worked the best with the smaller vocab and I am wondering if that holds here as well. |
HateBunnyPlzzz
added a commit
to Itssshikhar/parameter-golf
that referenced
this pull request
Apr 2, 2026
Approaches revamped (old eval-only approaches removed): - 01: Low-Rank Factored MLP (18 layers in 16MB via rank-128 MLP factors) - 02: Reptile Meta-Learning Warmdown (meta-optimize for TTT adaptability) - 03: SVD + Quantized Factors (13 layers via spectral compression) - 04: Multi-Token Prediction + BPB-Weighted Loss (training loss innovation) - 05: Gram-Newton-Schulz + FP8 Training (30% more steps in 10 min) Unmerged PR research saved to unmerged_runs/: - PR openai#1263: SLOT (0.9354 BPB, legality contested) - PR openai#1246: Trinity Ternary (0.9650 BPB) - PR openai#1241: MDLM Diffusion (0.9901 BPB) - PR openai#1252: WARP (1.0713 BPP) - PR openai#1257: Complement Training (1.0855 BPB) - PR openai#1274: Parallel Residuals + Depth Recurrence (1.0876 BPB) - PR openai#1260: MuonEq-R + Depth Recurrence (1.0929 BPB) - PR openai#1254: XSA + LoRA TTT (1.1070 BPB) Key finding: without eval tricks, frontier is ~1.09 BPB (PR openai#1260) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Omrigotlieb
added a commit
to Omrigotlieb/parameter-golf
that referenced
this pull request
Apr 3, 2026
Row-normalize the gradient update before Newton-Schulz orthogonalization. From PR openai#1260: ~0.001 BPB free improvement, zero extra parameters. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
dexhunter
added a commit
to dexhunter/parameter-golf
that referenced
this pull request
Apr 3, 2026
… (3-seed mean) Improves PR openai#1260 (1.0929) by using N_INT6=61 (one more int6 layer) with a smaller mini runner (21,396 bytes) that creates enough headroom. 3-seed mean: 1.0924 BPB / 2.5133 nats (seeds 42, 0, 7) All seeds under 16MB (max: 15,996,591 bytes) No TTT, no SLOT, no eval-time adaptation. Techniques: MuonEq-R optimizer, depth recurrence (layers 4,5 shared MLP), 61 int6 + 5 int5 Hessian-ranked GPTQ, brotli-11 compression. Built on PR openai#1218 by @clarkkev.
5 tasks
dexhunter
added a commit
to dexhunter/parameter-golf
that referenced
this pull request
Apr 3, 2026
….0912 (3-seed mean) WD-quantization synergy: higher weight decay (0.090 vs 0.085) compresses 5% better, creating headroom for ALL 66 layers at int6 precision. The extra quantization quality more than recovers the WD BPP cost. 3-seed mean: 1.0912 BPB / 2.5106 nats (seeds 42, 0, 1337) All seeds under 16MB with 32K+ margins. No TTT, no SLOT, no eval-time adaptation. Built on PR openai#1218 by @clarkkev. Improves PR openai#1260 (1.0929) by 0.0017 BPP.
Open
4 tasks
This was referenced Apr 3, 2026
4 tasks
This was referenced Apr 7, 2026
taka6745
pushed a commit
to taka6745/paramgolf
that referenced
this pull request
Apr 7, 2026
…on for Muon optimizer From arxiv:2603.28254 "MuonEq: Balancing Before Orthogonalization with Lightweight Equilibration" (Mar 2026). Used in 40+ openai/parameter-golf PRs, top record PR openai#1260 = val_bpb 1.0929 (3-seed mean). Inserts row normalization between Patch 17 Mousse block and Newton-Schulz: row_norm[i] = sqrt(sum_j G[i,j]^2) G[i,j] = G[i,j] / row_norm[i] Distinct from Mousse: Mousse is row+col (G/||row||/||col||), MuonEq-R is row-only (G/||row||). They can stack independently. Gated by USE_MUONEQ_R=1, falls back gracefully when unset. 4 MR experiments queued for validation: MR0_alone, MR1_plus_leaky_ng, MR2_seed42, MR3_mousse_plus_muoneqr This is the second optimizer-side patch in two fires. Both patches fit our train_loss metric so they can validate on cheap GPU loop without H100 escalation. If either lands within champion noise band 3.27-3.30, defensible ship for final stack. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Key Innovations
Results (8xH100 80GB SXM, PyTorch 2.9.1+cu128)
Changes from PR #1218
Credits
Test plan